Support quoted identifiers in Iceberg partitioning#12227
Conversation
There was a problem hiding this comment.
a column name can contain a quotation mark itself (").
in SQL, it is denoted by repeating the character "a column with "" quotation mark"
There was a problem hiding this comment.
It was indeed only partially implemented. I have adjusted the regex and added additional test cases. Build failure doesn't seem related java.util.concurrent.TimeoutException: Idle timeout 5000 ms. Build was green locally.
b0e3b42 to
c9f8fc8
Compare
|
Another place where the parsing of the partition field fails is the following: |
c9f8fc8 to
0537951
Compare
|
The currently supported syntax would be: Without the quotes we currently fail as it has to comply with the standard identifier regex: The reasoning behind would be that the array contains valid SQL strings that obey to the standard SQL spec. |
There was a problem hiding this comment.
Can you add a comment for each of the non-trivial regex patterns?
There was a problem hiding this comment.
I think with @findepi's great suggestion, that regex is now a lot simpler. It was indeed a bit convoluted.
|
I am not yet convinced we need quotes at all, for Do we envision the partition transforms to represent a language, e.g. have nested expression-like structure? |
From a user perspective I think the second one is a lot more obvious, clearly separating arguments in standard SQL syntax. Omitting the quotes will also not let us distinguish between quoted identifiers vs normal identifiers, which might get us in trouble once we truely support quoted identifiers. In the future I would say this is new feature, so it should conform to the SQL identifier specs as mentioned by @kasiafi in #11163 (comment) |
We don't need to. Partitioning specification is a mini-language and doesn't need to follow SQL identifier semantics (which are not as simple as they could be)
I agree. OTOH it's a fair price for putting commas in a column name. It's a bad idea and nothing will change that. Anyway, sans apostrophes it hurts my eyes, so yeah, let's go with quotes |
There was a problem hiding this comment.
IDENTIFIER -> UNQUOTED_IDENTIFIER
NAME -> IDENTIFIER
There was a problem hiding this comment.
| private static final String QUOTED_IDENTIFIER = "\"[^\"]*(?:(?:\"\")+[^\"]*)*\""; | |
| private static final String QUOTED_IDENTIFIER = "\"(?:\"\"|[^\"])*\""; |
0537951 to
5c085f2
Compare
|
@mdesmet So: should fail, but: should pass. I am only a bit concerned about the case. Before this change, only identifiers which were fully in lowercase would pass |
Good point.
I don't want SQL semantics here, let's be simpler. Let's require user to provide the exact same case as the column actually has (regardless of how it was created). -- should work
CREATE TABLE t(x bigint) WITH (partitioning = ARRAY['x']);
CREATE TABLE t(x bigint) WITH (partitioning = ARRAY['"x"']);
-- should work, the column is actually `x`, not `X`
CREATE TABLE t(X bigint) WITH (partitioning = ARRAY['x']);
-- should work, the column is actually `x`, not `X` (until #17)
CREATE TABLE t("X" bigint) WITH (partitioning = ARRAY['x']);
-- should fail, there is no column `X`. Until #17, the column is actually `x`.
CREATE TABLE t("X" bigint) WITH (partitioning = ARRAY['"X"']); |
I suggest that instead we require that users pass only lowercase names (through a check in Comparing case-sensitive seems like something we might not want to do right now. Currently, we resolve column names case-insensitive, so let's be consistent. |
This is effectively what i want. |
|
Noted conclusion under #12226 (comment) |
|
@mdesmet please add test cases as indicated in #12227 (comment) |
ba315b8 to
384baa2
Compare
There was a problem hiding this comment.
Link to #17
please add -- or better: remove the comment here, leaving only the one at fromIdentifier
There was a problem hiding this comment.
the [^A-Z] isn't sufficient for that.
What about Ą?
simplify regex, and use .toLowerCase(ENGLISH) in Java to verify parsed value is all-lower for now
There was a problem hiding this comment.
Regex has been simplified and verification is done using .toLowercase(ENGLISH).
There was a problem hiding this comment.
When should i call createPartitionSpec and when parsePartitionFields?
they have similar names, and even more similar semantics.
Also, how does introduction of the wrapper (that throws TrinoException) related to adding quoted identifiers?
the parsing could fail even before the change, right? (eg unsupported transform, missing closing brace, etc)
There was a problem hiding this comment.
I removed the wrapper and just catch the exception in parsePartitionFields now and rethrow as TrinoException. This makes the testing easier in BaseConnectorTest as we are expecting TrinoException there,
5577dab to
80b5e05
Compare
|
Following test failed in last build. It's actually related to #12626. I have setup the product tests to run with |
b95953f to
2dd644f
Compare
There was a problem hiding this comment.
It seems like SIMPLE_NAME tries to achieve the same as UNQUOTED_IDENTIFIER but not exactly following SQL semantics. Changing it had some impacts in exception checking in the tests.
There was a problem hiding this comment.
What is the reasoning behind adding this setting for the DevelopmentServer Iceberg connector?
There was a problem hiding this comment.
See my comment #12227 (comment)
We can never really ensure that all cleanup (finally stuff) is run eg. For example in case of Hive timeouts some table location might not be emptied, so I think this is the best way to handle this as this ensures every table will have its unique location. This can be moved to a separate PR if necessary.
There was a problem hiding this comment.
This is a bummer.
We should provide in Trino a way for the users to cope with such situations because otherwise the users would face a Spark lock-in for such situations.
I've created a PR in iceberg to address this problem. apache/iceberg#5110
There was a problem hiding this comment.
To give you some context. The original PR I did handle this correctly. According SQL quoted identifier semantics, a quoted identifier should be matched case sensitively. The backticks in Spark behave the same as the quoted identifiers in SQL. IMHO this is not an issue with Iceberg.
There was some discussion about this feature as indeed in Trino the column names are converted into lowercase. Take for example this query
CREATE TABLE t("X" bigint) WITH (partitioning = ARRAY['"X"']);
Because the column name is converted to lowercase in Trino, this query would fail, as at that time the "X" has become x and the partitioning parsing logic fails to find this column. This is definitely confusing for the user. In a ALTER TABLE scenario however this is not true. The column will be known as "X". So we agreed on blocking this scenario for now, as mentioned on #12226 (comment)
Here is the code that explicitly blocks this, removing the lowercase verification would fix the Trino query above.
There was a problem hiding this comment.
IMHO this is not an issue with Iceberg.
I think that the PR apache/iceberg#5110 is more of a usability "improvement" and not a bug.
Please correct me if I'm wrong, I don't think it should be mandatory to specify the source column name in the same case in the table definition.
There was a problem hiding this comment.
I think we better stick to SQL semantics.
Imagine following query perfectly valid syntax (not currently working in Trino, but actually working on snowflake):
CREATE TABLE t("X" bigint, "x" bigint) WITH (partitioning = ARRAY['"X"']);
Because of the quoted identifers we would know which x to match. I would think Iceberg partitioning spec should be exactly matched against the Iceberg schema and not impose a certain way of working.
This apparently also rebased on current master. |
The build failed because of the refactor of To summarise the changes:
Let me know what you think. |
What kind of problem is this fixing? (btw this will become default in #12941, so we don't want to set this explicitly in product tests) |
With #12626, we throw an exception when trying to create a table and files exist on that location. Sometimes tests fail randomly and are retried without paths being cleaned. In this case
Anyway if this setting becomes default (which I definitely support), this is not an issue anymore. Will remove that commit. |
2dd644f to
89b4d46
Compare
There was a problem hiding this comment.
That changes semantics of the method.
previously, for My_Table we would output "My_Table".
now we output My_Table without quotes.
if the table name is actually My_Table, it needs to be referenced as "My_Table" in SQL,
so the output of this command no longer can be pasted into SQL.
Please revert the change here
There was a problem hiding this comment.
You indeed point out a bug, that also applies to the fromColumnToIdentifier method. It is however the same semantics: we are taking something from metadata (a column or table name), and need to ensure that it can be pasted in an SQL editor, respecting SQL identifier semantics.
There was a problem hiding this comment.
The method introduced here is unused (in the commit which introduced it), and it's unclear what's the context in which is should be used.
Squash the changes with the next commit.
There was a problem hiding this comment.
The method introduced here is unused (in the commit which introduced it), and it's unclear what's the context in which is should be used.
Squash the changes with the next commit.
There was a problem hiding this comment.
Later it becomes used in PartitionFields which is the context in which it's comprehensible.
(otherwise the "identifier" will be misleading, as normally String identifier should not be expected to have any quotes inside (or the quotes be treated literal).
Move the method to PartitionFields
There was a problem hiding this comment.
The placement in IcebergUtils had been discussed in #12872, the need to parse quoted identifiers also exist in other table properties (sort_order, orc_bloom_filter, ...).
89b4d46 to
5675f13
Compare
|
(cannot comment at #12227 (comment))
I am aware of that PR.
It's important for a shared method to have an easy to understand semantics (name, input and output types need to intuitively hint and what it does). |
5675f13 to
1ea29fc
Compare
|
@mdesmet failures seem related. |
e69642d to
95d16a9
Compare
95d16a9 to
18cd9b3
Compare
I have rebased with latest master and resolved the issues. |

Description
Adds support for quoted identifiers in Iceberg partitioning.
Trino Iceberg allows tables to be created using quoted identifiers.
CREATE TABLE test AS SELECT 1 as "a quoted identifier";However when a partitioning property is added these columns can't be declared.
CREATE TABLE test WITH(partitioning=ARRAY['a quoted identifier']) ...fails with errorInvalid partition field declaration: a quoted identifierFix
Change to Iceberg connector
Related issues, pull requests, and links
resources. For example:
Documentation
( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
( ) No release notes entries required.
( ) Release notes entries required with the following suggested text: